{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mongodb-developer/GenAI-Showcase/blob/main/notebooks/advanced_techniques/langchain_parent_document_retrieval.ipynb)\n",
        "\n",
        "[![View Article](https://img.shields.io/badge/View%20Article-blue)](https://www.mongodb.com/developer/products/atlas/parent-doc-retrieval/?utm_campaign=devrel&utm_source=cross-post&utm_medium=organic_social&utm_content=https%3A%2F%2Fgithub.com%2Fmongodb-developer%2FGenAI-Showcase&utm_term=apoorva.joshi)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# Parent Document Retrieval Using MongoDB and LangChain\n",
        "\n",
        "This notebook shows you how to implement parent document retrieval in your RAG application using MongoDB's LangChain integration."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Step 1: Install required libraries\n",
        "\n",
        "- **datasets**: Python package to download datasets from Hugging Face\n",
        "\n",
        "- **pymongo**: Python driver for MongoDB\n",
        "\n",
        "- **langchain**: Python package for LangChain's core modules\n",
        "\n",
        "- **langchain-openai**: Python package to use OpenAI models via LangChain\n",
        "\n",
        "- **langgraph**: Python package to orchestrate LLM workflows as graphs\n",
        "\n",
        "- **langchain-mongodb**: Python package to use MongoDB features in LangChain\n",
        "\n",
        "- **langchain-openai**: Python package to use OpenAI models via LangChain"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 1,
      "metadata": {},
      "outputs": [],
      "source": [
        "! pip install -qU datasets pymongo langchain langgraph langchain-mongodb langchain-openai"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Step 2: Setup prerequisites\n",
        "\n",
        "- **Set the MongoDB connection string**: Follow the steps [here](https://www.mongodb.com/docs/manual/reference/connection-string/) to get the connection string from the Atlas UI.\n",
        "\n",
        "- **Set the OpenAI API key**: Steps to obtain an API key are [here](https://help.openai.com/en/articles/4936850-where-do-i-find-my-openai-api-key)\n",
        "\n",
        "- **Set the Hugging Face token**: Steps to create a token are [here](https://huggingface.co/docs/hub/en/security-tokens#how-to-manage-user-access-tokens). You only need **read** token for this tutorial."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 2,
      "metadata": {},
      "outputs": [],
      "source": [
        "import getpass\n",
        "import os\n",
        "\n",
        "from pymongo import MongoClient"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 3,
      "metadata": {},
      "outputs": [],
      "source": [
        "os.environ[\"OPENAI_API_KEY\"] = getpass.getpass(\"Enter your OpenAI API Key:\")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 4,
      "metadata": {},
      "outputs": [
        {
          "data": {
            "text/plain": [
              "{'ok': 1.0,\n",
              " '$clusterTime': {'clusterTime': Timestamp(1734037711, 1),\n",
              "  'signature': {'hash': b'v\\xa2\\xc7\\xf6\\xc4\\xc5z\\x97%Q_\\xc1\\xa5\\xaf}\\x05(\\x92\\x80\\xc2',\n",
              "   'keyId': 7390069253761662978}},\n",
              " 'operationTime': Timestamp(1734037711, 1)}"
            ]
          },
          "execution_count": 4,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "MONGODB_URI = getpass.getpass(\"Enter your MongoDB connection string:\")\n",
        "mongodb_client = MongoClient(\n",
        "    MONGODB_URI, appname=\"devrel.showcase.parent_doc_retrieval\"\n",
        ")\n",
        "mongodb_client.admin.command(\"ping\")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 5,
      "metadata": {},
      "outputs": [],
      "source": [
        "os.environ[\"HF_TOKEN\"] = getpass.getpass(\"Enter your HF Access Token:\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Step 3: Load the dataset"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 6,
      "metadata": {},
      "outputs": [
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "/Users/apoorva.joshi/Documents/GenAI-Showcase/.venv/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
            "  from .autonotebook import tqdm as notebook_tqdm\n"
          ]
        }
      ],
      "source": [
        "import pandas as pd\n",
        "from datasets import load_dataset"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 7,
      "metadata": {},
      "outputs": [],
      "source": [
        "data = load_dataset(\"mongodb-eai/docs\", streaming=True, split=\"train\")\n",
        "data_head = data.take(1000)\n",
        "df = pd.DataFrame(data_head)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 8,
      "metadata": {},
      "outputs": [
        {
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>updated</th>\n",
              "      <th>_id</th>\n",
              "      <th>metadata</th>\n",
              "      <th>action</th>\n",
              "      <th>sourceName</th>\n",
              "      <th>body</th>\n",
              "      <th>url</th>\n",
              "      <th>format</th>\n",
              "      <th>title</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>{'$date': '2024-05-20T17:30:49.148Z'}</td>\n",
              "      <td>{'$oid': '664b88c96e4f895074208162'}</td>\n",
              "      <td>{'contentType': None, 'pageDescription': None,...</td>\n",
              "      <td>created</td>\n",
              "      <td>snooty-cloud-docs</td>\n",
              "      <td># View Database Access History\\n\\n- This featu...</td>\n",
              "      <td>https://mongodb.com/docs/atlas/access-tracking/</td>\n",
              "      <td>md</td>\n",
              "      <td>View Database Access History</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>{'$date': '2024-05-20T17:30:49.148Z'}</td>\n",
              "      <td>{'$oid': '664b88c96e4f895074208178'}</td>\n",
              "      <td>{'contentType': None, 'pageDescription': None,...</td>\n",
              "      <td>created</td>\n",
              "      <td>snooty-cloud-docs</td>\n",
              "      <td># Manage Organization Teams\\n\\nYou can create ...</td>\n",
              "      <td>https://mongodb.com/docs/atlas/access/manage-t...</td>\n",
              "      <td>md</td>\n",
              "      <td>Manage Organization Teams</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>{'$date': '2024-05-20T17:30:49.148Z'}</td>\n",
              "      <td>{'$oid': '664b88c96e4f895074208183'}</td>\n",
              "      <td>{'contentType': None, 'pageDescription': None,...</td>\n",
              "      <td>created</td>\n",
              "      <td>snooty-cloud-docs</td>\n",
              "      <td># Manage Organizations\\n\\nIn the organizations...</td>\n",
              "      <td>https://mongodb.com/docs/atlas/access/orgs-cre...</td>\n",
              "      <td>md</td>\n",
              "      <td>Manage Organizations</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
              "      <td>{'$date': '2024-05-20T17:30:49.148Z'}</td>\n",
              "      <td>{'$oid': '664b88c96e4f89507420818f'}</td>\n",
              "      <td>{'contentType': None, 'pageDescription': None,...</td>\n",
              "      <td>created</td>\n",
              "      <td>snooty-cloud-docs</td>\n",
              "      <td># Alert Basics\\n\\nAtlas provides built-in tool...</td>\n",
              "      <td>https://mongodb.com/docs/atlas/alert-basics/</td>\n",
              "      <td>md</td>\n",
              "      <td>Alert Basics</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4</th>\n",
              "      <td>{'$date': '2024-05-20T17:30:49.148Z'}</td>\n",
              "      <td>{'$oid': '664b88c96e4f89507420819d'}</td>\n",
              "      <td>{'contentType': None, 'pageDescription': None,...</td>\n",
              "      <td>created</td>\n",
              "      <td>snooty-cloud-docs</td>\n",
              "      <td># Resolve Alerts\\n\\nAtlas issues alerts for th...</td>\n",
              "      <td>https://mongodb.com/docs/atlas/alert-resolutions/</td>\n",
              "      <td>md</td>\n",
              "      <td>Resolve Alerts</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>"
            ],
            "text/plain": [
              "                                 updated  \\\n",
              "0  {'$date': '2024-05-20T17:30:49.148Z'}   \n",
              "1  {'$date': '2024-05-20T17:30:49.148Z'}   \n",
              "2  {'$date': '2024-05-20T17:30:49.148Z'}   \n",
              "3  {'$date': '2024-05-20T17:30:49.148Z'}   \n",
              "4  {'$date': '2024-05-20T17:30:49.148Z'}   \n",
              "\n",
              "                                    _id  \\\n",
              "0  {'$oid': '664b88c96e4f895074208162'}   \n",
              "1  {'$oid': '664b88c96e4f895074208178'}   \n",
              "2  {'$oid': '664b88c96e4f895074208183'}   \n",
              "3  {'$oid': '664b88c96e4f89507420818f'}   \n",
              "4  {'$oid': '664b88c96e4f89507420819d'}   \n",
              "\n",
              "                                            metadata   action  \\\n",
              "0  {'contentType': None, 'pageDescription': None,...  created   \n",
              "1  {'contentType': None, 'pageDescription': None,...  created   \n",
              "2  {'contentType': None, 'pageDescription': None,...  created   \n",
              "3  {'contentType': None, 'pageDescription': None,...  created   \n",
              "4  {'contentType': None, 'pageDescription': None,...  created   \n",
              "\n",
              "          sourceName                                               body  \\\n",
              "0  snooty-cloud-docs  # View Database Access History\\n\\n- This featu...   \n",
              "1  snooty-cloud-docs  # Manage Organization Teams\\n\\nYou can create ...   \n",
              "2  snooty-cloud-docs  # Manage Organizations\\n\\nIn the organizations...   \n",
              "3  snooty-cloud-docs  # Alert Basics\\n\\nAtlas provides built-in tool...   \n",
              "4  snooty-cloud-docs  # Resolve Alerts\\n\\nAtlas issues alerts for th...   \n",
              "\n",
              "                                                 url format  \\\n",
              "0    https://mongodb.com/docs/atlas/access-tracking/     md   \n",
              "1  https://mongodb.com/docs/atlas/access/manage-t...     md   \n",
              "2  https://mongodb.com/docs/atlas/access/orgs-cre...     md   \n",
              "3       https://mongodb.com/docs/atlas/alert-basics/     md   \n",
              "4  https://mongodb.com/docs/atlas/alert-resolutions/     md   \n",
              "\n",
              "                          title  \n",
              "0  View Database Access History  \n",
              "1     Manage Organization Teams  \n",
              "2          Manage Organizations  \n",
              "3                  Alert Basics  \n",
              "4                Resolve Alerts  "
            ]
          },
          "execution_count": 8,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "df.head()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Step 4: Convert dataset to LangChain Documents"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 9,
      "metadata": {},
      "outputs": [],
      "source": [
        "from langchain_core.documents import Document"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 10,
      "metadata": {},
      "outputs": [],
      "source": [
        "docs = []\n",
        "metadata_fields = [\"updated\", \"url\", \"title\"]\n",
        "for _, row in df.iterrows():\n",
        "    content = row[\"body\"]\n",
        "    metadata = row[\"metadata\"]\n",
        "    for field in metadata_fields:\n",
        "        metadata[field] = row[field]\n",
        "    docs.append(Document(page_content=content, metadata=metadata))"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 11,
      "metadata": {},
      "outputs": [
        {
          "data": {
            "text/plain": [
              "Document(metadata={'contentType': None, 'pageDescription': None, 'productName': 'MongoDB Atlas', 'tags': ['atlas', 'docs'], 'version': None, 'updated': {'$date': '2024-05-20T17:30:49.148Z'}, 'url': 'https://mongodb.com/docs/atlas/access-tracking/', 'title': 'View Database Access History'}, page_content='# View Database Access History\\n\\n- This feature is not available for `M0` free clusters, `M2`, and `M5` clusters. To learn more, see Atlas M0 (Free Cluster), M2, and M5 Limits.\\n\\n- This feature is not supported on Serverless instances at this time. To learn more, see Serverless Instance Limitations.\\n\\n## Overview\\n\\nAtlas parses the MongoDB database logs to collect a list of authentication requests made against your clusters through the following methods:\\n\\n- `mongosh`\\n\\n- Compass\\n\\n- Drivers\\n\\nAuthentication requests made with API Keys through the Atlas Administration API are not logged.\\n\\nAtlas logs the following information for each authentication request within the last 7 days:\\n\\n<table>\\n<tr>\\n<th id=\"Field\">\\nField\\n\\n</th>\\n<th id=\"Description\">\\nDescription\\n\\n</th>\\n</tr>\\n<tr>\\n<td headers=\"Field\">\\nTimestamp\\n\\n</td>\\n<td headers=\"Description\">\\nThe date and time of the authentication request.\\n\\n</td>\\n</tr>\\n<tr>\\n<td headers=\"Field\">\\nUsername\\n\\n</td>\\n<td headers=\"Description\">\\nThe username associated with the database user who made the authentication request.\\n\\nFor LDAP usernames, the UI displays the resolved LDAP name. Hover over the name to see the full LDAP username.\\n\\n</td>\\n</tr>\\n<tr>\\n<td headers=\"Field\">\\nIP Address\\n\\n</td>\\n<td headers=\"Description\">\\nThe IP address of the machine that sent the authentication request.\\n\\n</td>\\n</tr>\\n<tr>\\n<td headers=\"Field\">\\nHost\\n\\n</td>\\n<td headers=\"Description\">\\nThe target server that processed the authentication request.\\n\\n</td>\\n</tr>\\n<tr>\\n<td headers=\"Field\">\\nAuthentication Source\\n\\n</td>\\n<td headers=\"Description\">\\nThe database that the authentication request was made against. `admin` is the authentication source for SCRAM-SHA users and `$external` for LDAP users.\\n\\n</td>\\n</tr>\\n<tr>\\n<td headers=\"Field\">\\nAuthentication Result\\n\\n</td>\\n<td headers=\"Description\">\\nThe success or failure of the authentication request. A reason code is displayed for the failed authentication requests.\\n\\n</td>\\n</tr>\\n</table>Authentication requests are pre-sorted by descending timestamp with 25 entries per page.\\n\\n### Logging Limitations\\n\\nIf a cluster experiences an activity spike and generates an extremely large quantity of log messages, Atlas may stop collecting and storing new logs for a period of time.\\n\\nLog analysis rate limits apply only to the Performance Advisor UI, the Query Insights UI, the Access Tracking UI, and the Atlas Search Query Analytics UI. Downloadable log files are always complete.\\n\\nIf authentication requests occur during a period when logs are not collected, they will not appear in the database access history.\\n\\n## Required Access\\n\\nTo view database access history, you must have `Project Owner` or `Organization Owner` access to Atlas.\\n\\n## Procedure\\n\\n<Tabs>\\n\\n<Tab name=\"Atlas CLI\">\\n\\nTo return the access logs for a cluster using the Atlas CLI, run the following command:\\n\\n```sh\\n\\natlas accessLogs list [options]\\n\\n```\\n\\nTo learn more about the command syntax and parameters, see the Atlas CLI documentation for atlas accessLogs list.\\n\\n- Install the Atlas CLI\\n\\n- Connect to the Atlas CLI\\n\\n</Tab>\\n\\n<Tab name=\"Atlas Administration API\">\\n\\nTo view the database access history using the API, see Access Tracking.\\n\\n</Tab>\\n\\n<Tab name=\"Atlas UI\">\\n\\nUse the following procedure to view your database access history using the Atlas UI:\\n\\n### Navigate to the Clusters page for your project.\\n\\n- If it is not already displayed, select the organization that contains your desired project from the  Organizations menu in the navigation bar.\\n\\n- If it is not already displayed, select your desired project from the Projects menu in the navigation bar.\\n\\n- If the Clusters page is not already displayed, click Database in the sidebar.\\n\\n### View the cluster\\'s database access history.\\n\\n- On the cluster card, click .\\n\\n- Select View Database Access History.\\n\\nor\\n\\n- Click the cluster name.\\n\\n- Click .\\n\\n- Select View Database Access History.\\n\\n</Tab>\\n\\n</Tabs>\\n\\n')"
            ]
          },
          "execution_count": 11,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "docs[0]"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 12,
      "metadata": {},
      "outputs": [
        {
          "data": {
            "text/plain": [
              "1000"
            ]
          },
          "execution_count": 12,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "len(docs)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Step 5: Instantiate the retriever"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 16,
      "metadata": {},
      "outputs": [],
      "source": [
        "from langchain_mongodb.retrievers import (\n",
        "    MongoDBAtlasParentDocumentRetriever,\n",
        ")\n",
        "from langchain_openai import OpenAIEmbeddings\n",
        "from langchain_text_splitters import RecursiveCharacterTextSplitter"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 17,
      "metadata": {},
      "outputs": [],
      "source": [
        "embedding_model = OpenAIEmbeddings(model=\"text-embedding-3-small\")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 18,
      "metadata": {},
      "outputs": [],
      "source": [
        "DB_NAME = \"langchain\"\n",
        "COLLECTION_NAME = \"parent_doc\""
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 19,
      "metadata": {},
      "outputs": [],
      "source": [
        "def get_splitter(chunk_size: int) -> RecursiveCharacterTextSplitter:\n",
        "    \"\"\"\n",
        "    Returns a token-based text splitter with overlap\n",
        "\n",
        "    Args:\n",
        "        chunk_size (_type_): Chunk size in number of tokens\n",
        "\n",
        "    Returns:\n",
        "        RecursiveCharacterTextSplitter: Recursive text splitter object\n",
        "    \"\"\"\n",
        "    return RecursiveCharacterTextSplitter.from_tiktoken_encoder(\n",
        "        encoding_name=\"cl100k_base\",\n",
        "        chunk_size=chunk_size,\n",
        "        chunk_overlap=0.15 * chunk_size,\n",
        "    )"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Parent document retriever"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 20,
      "metadata": {},
      "outputs": [],
      "source": [
        "parent_doc_retriever = MongoDBAtlasParentDocumentRetriever.from_connection_string(\n",
        "    connection_string=MONGODB_URI,\n",
        "    embedding_model=embedding_model,\n",
        "    child_splitter=get_splitter(200),\n",
        "    database_name=DB_NAME,\n",
        "    collection_name=COLLECTION_NAME,\n",
        "    text_key=\"page_content\",\n",
        "    search_kwargs={\"top_k\": 10},\n",
        ")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# # Parent chunk retriever\n",
        "# parent_chunk_retriever = MongoDBAtlasParentDocumentRetriever.from_connection_string(\n",
        "#     connection_string=MONGODB_URI,\n",
        "#     embedding_model=embedding_model,\n",
        "#     child_splitter=get_splitter(200),\n",
        "#     parent_splitter=get_splitter(800),\n",
        "#     database_name=DB_NAME,\n",
        "#     collection_name=COLLECTION_NAME,\n",
        "#     text_key=\"page_content\",\n",
        "#     search_kwargs={\"top_k\": 10},\n",
        "# )"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Step 6: Ingest documents into MongoDB"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 22,
      "metadata": {},
      "outputs": [],
      "source": [
        "import asyncio\n",
        "from typing import Generator, List"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 23,
      "metadata": {},
      "outputs": [],
      "source": [
        "BATCH_SIZE = 256\n",
        "MAX_CONCURRENCY = 4"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 24,
      "metadata": {},
      "outputs": [],
      "source": [
        "async def process_batch(batch: Generator, semaphore: asyncio.Semaphore) -> None:\n",
        "    \"\"\"\n",
        "    Ingest batches of documents into MongoDB\n",
        "\n",
        "    Args:\n",
        "        batch (Generator): Chunk of documents to ingest\n",
        "        semaphore (as): Asyncio semaphore\n",
        "    \"\"\"\n",
        "    async with semaphore:\n",
        "        await parent_doc_retriever.aadd_documents(batch)\n",
        "        print(f\"Processed {len(batch)} documents\")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 25,
      "metadata": {},
      "outputs": [],
      "source": [
        "def get_batches(docs: List[Document], batch_size: int) -> Generator:\n",
        "    \"\"\"\n",
        "    Return batches of documents to ingest into MongoDB\n",
        "\n",
        "    Args:\n",
        "        docs (List[Document]): List of LangChain documents\n",
        "        batch_size (int): Batch size\n",
        "\n",
        "    Yields:\n",
        "        Generator: Batch of documents\n",
        "    \"\"\"\n",
        "    for i in range(0, len(docs), batch_size):\n",
        "        yield docs[i : i + batch_size]"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 26,
      "metadata": {},
      "outputs": [],
      "source": [
        "async def process_docs(docs: List[Document]) -> List[None]:\n",
        "    \"\"\"\n",
        "    Asynchronously ingest LangChain documents into MongoDB\n",
        "\n",
        "    Args:\n",
        "        docs (List[Document]): List of LangChain documents\n",
        "\n",
        "    Returns:\n",
        "        List[None]: Results of the task executions\n",
        "    \"\"\"\n",
        "    semaphore = asyncio.Semaphore(MAX_CONCURRENCY)\n",
        "    batches = get_batches(docs, BATCH_SIZE)\n",
        "\n",
        "    tasks = []\n",
        "    for batch in batches:\n",
        "        tasks.append(process_batch(batch, semaphore))\n",
        "    # Gather results from all tasks\n",
        "    results = await asyncio.gather(*tasks)\n",
        "    return results"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 27,
      "metadata": {},
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Deletion complete.\n",
            "Processed 256 documents\n",
            "Processed 256 documents\n",
            "Processed 256 documents\n",
            "Processed 232 documents\n"
          ]
        }
      ],
      "source": [
        "collection = mongodb_client[DB_NAME][COLLECTION_NAME]\n",
        "# Delete any existing documents from the collection\n",
        "collection.delete_many({})\n",
        "print(\"Deletion complete.\")\n",
        "# Ingest LangChain documents into MongoDB\n",
        "results = await process_docs(docs)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Step 7: Create a vector search index"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 28,
      "metadata": {},
      "outputs": [],
      "source": [
        "from pymongo.errors import OperationFailure\n",
        "from pymongo.operations import SearchIndexModel"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 29,
      "metadata": {},
      "outputs": [],
      "source": [
        "VS_INDEX_NAME = \"vector_index\""
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 30,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Vector search index definition\n",
        "model = SearchIndexModel(\n",
        "    definition={\n",
        "        \"fields\": [\n",
        "            {\n",
        "                \"type\": \"vector\",\n",
        "                \"path\": \"embedding\",\n",
        "                \"numDimensions\": 1536,\n",
        "                \"similarity\": \"cosine\",\n",
        "            }\n",
        "        ]\n",
        "    },\n",
        "    name=VS_INDEX_NAME,\n",
        "    type=\"vectorSearch\",\n",
        ")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Successfully created index vector_index for collection parent_doc\n"
          ]
        }
      ],
      "source": [
        "# Check if the index already exists, if not create it\n",
        "try:\n",
        "    collection.create_search_index(model=model)\n",
        "    print(\n",
        "        f\"Successfully created index {VS_INDEX_NAME} for collection {COLLECTION_NAME}\"\n",
        "    )\n",
        "except OperationFailure:\n",
        "    print(\n",
        "        f\"Duplicate index {VS_INDEX_NAME} found for collection {COLLECTION_NAME}. Skipping index creation.\"\n",
        "    )"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Step 8: Usage"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### In a RAG application"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 31,
      "metadata": {},
      "outputs": [],
      "source": [
        "from langchain_core.output_parsers import StrOutputParser\n",
        "from langchain_core.prompts import ChatPromptTemplate\n",
        "from langchain_core.runnables import RunnablePassthrough\n",
        "from langchain_openai import ChatOpenAI"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 32,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Retrieve and parse documents\n",
        "retrieve = {\n",
        "    \"context\": parent_doc_retriever\n",
        "    | (lambda docs: \"\\n\\n\".join([d.page_content for d in docs])),\n",
        "    \"question\": RunnablePassthrough(),\n",
        "}\n",
        "template = \"\"\"Answer the question based only on the following context. If no context is provided, respond with I DON't KNOW: \\\n",
        "{context}\n",
        "\n",
        "Question: {question}\n",
        "\"\"\"\n",
        "# Define the chat prompt\n",
        "prompt = ChatPromptTemplate.from_template(template)\n",
        "# Define the model to be used for chat completion\n",
        "llm = ChatOpenAI(temperature=0, model=\"gpt-4o-2024-11-20\")\n",
        "# Parse output as a string\n",
        "parse_output = StrOutputParser()\n",
        "# Naive RAG chain\n",
        "rag_chain = retrieve | prompt | llm | parse_output"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 33,
      "metadata": {},
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "To improve slow queries in MongoDB, you can follow these steps:\n",
            "\n",
            "1. **Use the Performance Advisor**:\n",
            "   - The Performance Advisor monitors slow queries and suggests indexes to improve performance.\n",
            "   - Create the suggested indexes, especially those with high Impact scores and low Average Query Targeting scores.\n",
            "\n",
            "2. **Analyze Query Performance**:\n",
            "   - Use the **Query Profiler** to identify slow-running operations and their key performance statistics.\n",
            "   - Use the **Real-Time Performance Panel (RTPP)** to evaluate query execution times and the ratio of documents scanned to documents returned.\n",
            "   - Use **Namespace Insights** to monitor collection-level query latency.\n",
            "\n",
            "3. **Optimize Indexes**:\n",
            "   - Create indexes that support your queries to reduce the time needed to search for results.\n",
            "   - Remove unused or inefficient indexes to improve write performance and free storage space.\n",
            "   - Perform rolling index builds to minimize performance impact on replica sets and sharded clusters.\n",
            "\n",
            "4. **Fix Query Targeting Issues**:\n",
            "   - Address `Query Targeting: Scanned Objects / Returned` or `Query Targeting: Scanned / Returned` alerts by adding indexes to support inefficient queries.\n",
            "   - Use the `cursor.explain()` command to analyze query plans and identify inefficiencies.\n",
            "\n",
            "5. **Follow Best Practices**:\n",
            "   - Avoid creating documents with large array fields that are costly to search and index.\n",
            "   - Optimize queries to take advantage of existing indexes.\n",
            "   - Use the suggested indexes from the Performance Advisor when they align with your indexing strategies.\n",
            "\n",
            "6. **Monitor and Adjust**:\n",
            "   - Use Query Targeting metrics and Query Profiler to monitor progress and ensure query performance improves.\n",
            "   - Adjust the slow query threshold if needed to better suit your workload.\n",
            "\n",
            "By implementing these steps, you can significantly improve the performance of slow queries in MongoDB.\n"
          ]
        }
      ],
      "source": [
        "# Test the RAG chain\n",
        "print(rag_chain.invoke(\"How do I improve slow queries in MongoDB?\"))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### In an AI agent"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 34,
      "metadata": {},
      "outputs": [],
      "source": [
        "from typing import Annotated, Dict\n",
        "\n",
        "from langchain.agents import tool\n",
        "from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder\n",
        "from langgraph.graph import END, START, StateGraph\n",
        "from langgraph.graph.message import add_messages\n",
        "from langgraph.prebuilt import ToolNode, tools_condition\n",
        "from typing_extensions import TypedDict"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 35,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Converting the retriever into an agent tool\n",
        "@tool\n",
        "def get_info_about_mongodb(user_query: str) -> str:\n",
        "    \"\"\"\n",
        "    Retrieve information about MongoDB.\n",
        "\n",
        "    Args:\n",
        "    user_query (str): The user's query string.\n",
        "\n",
        "    Returns:\n",
        "    str: The retrieved information formatted as a string.\n",
        "    \"\"\"\n",
        "    docs = parent_doc_retriever.invoke(user_query)\n",
        "    context = \"\\n\\n\".join([d.page_content for d in docs])\n",
        "    return context"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 36,
      "metadata": {},
      "outputs": [],
      "source": [
        "tools = [get_info_about_mongodb]"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 37,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Define the LLM to use as the brain of the agent\n",
        "llm = ChatOpenAI(temperature=0, model=\"gpt-4o-2024-11-20\")\n",
        "# Agent prompt\n",
        "prompt = ChatPromptTemplate.from_messages(\n",
        "    [\n",
        "        (\n",
        "            \"You are a helpful AI assistant.\"\n",
        "            \" You are provided with tools to answer questions about MongoDB.\"\n",
        "            \" Think step-by-step and use these tools to get the information required to answer the user query.\"\n",
        "            \" Do not re-run tools unless absolutely necessary.\"\n",
        "            \" If you are not able to get enough information using the tools, reply with I DON'T KNOW.\"\n",
        "            \" You have access to the following tools: {tool_names}.\"\n",
        "        ),\n",
        "        MessagesPlaceholder(variable_name=\"messages\"),\n",
        "    ]\n",
        ")\n",
        "# Partial the prompt with tool names\n",
        "prompt = prompt.partial(tool_names=\", \".join([tool.name for tool in tools]))\n",
        "# Bind tools to LLM\n",
        "llm_with_tools = prompt | llm.bind_tools(tools)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 38,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Define graph state\n",
        "class GraphState(TypedDict):\n",
        "    messages: Annotated[list, add_messages]"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 39,
      "metadata": {},
      "outputs": [],
      "source": [
        "def agent(state: GraphState) -> Dict[str, List]:\n",
        "    \"\"\"\n",
        "    Agent node\n",
        "\n",
        "    Args:\n",
        "        state (GraphState): Graph state\n",
        "\n",
        "    Returns:\n",
        "        Dict[str, List]: Updates to the graph state\n",
        "    \"\"\"\n",
        "    messages = state[\"messages\"]\n",
        "    response = llm_with_tools.invoke(messages)\n",
        "    # We return a list, because this will get added to the existing list\n",
        "    return {\"messages\": [response]}"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 40,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Convert tools into a graph node\n",
        "tool_node = ToolNode(tools)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 41,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Parameterize the graph with the state\n",
        "graph = StateGraph(GraphState)\n",
        "# Add graph nodes\n",
        "graph.add_node(\"agent\", agent)\n",
        "graph.add_node(\"tools\", tool_node)\n",
        "# Add graph edges\n",
        "graph.add_edge(START, \"agent\")\n",
        "graph.add_edge(\"tools\", \"agent\")\n",
        "graph.add_conditional_edges(\n",
        "    \"agent\",\n",
        "    tools_condition,\n",
        "    {\"tools\": \"tools\", END: END},\n",
        ")\n",
        "# Compile the graph\n",
        "app = graph.compile()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 42,
      "metadata": {},
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Node agent:\n",
            "{'messages': [AIMessage(content='', additional_kwargs={'tool_calls': [{'id': 'call_sifH0mrhbpesQie4BTnQytNk', 'function': {'arguments': '{\"user_query\":\"How do I improve slow queries in MongoDB?\"}', 'name': 'get_info_about_mongodb'}, 'type': 'function'}], 'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 27, 'prompt_tokens': 165, 'total_tokens': 192, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-4o-2024-11-20', 'system_fingerprint': 'fp_d924043139', 'finish_reason': 'tool_calls', 'logprobs': None}, id='run-bc1db263-f4f5-40ba-a6ba-b18a3585e095-0', tool_calls=[{'name': 'get_info_about_mongodb', 'args': {'user_query': 'How do I improve slow queries in MongoDB?'}, 'id': 'call_sifH0mrhbpesQie4BTnQytNk', 'type': 'tool_call'}], usage_metadata={'input_tokens': 165, 'output_tokens': 27, 'total_tokens': 192, 'input_token_details': {'audio': 0, 'cache_read': 0}, 'output_token_details': {'audio': 0, 'reasoning': 0}})]}\n",
            "Node tools:\n",
            "{'messages': [ToolMessage(content='# Monitor and Improve Slow Queries\\n\\n*Only available on M10+ clusters and serverless instances*\\n\\nThe Performance Advisor monitors queries that MongoDB considers slow and suggests new indexes to improve query performance. The threshold for slow queries varies based on the average time of operations on your cluster to provide recommendations pertinent to your workload.\\n\\nRecommended indexes are accompanied by sample queries, grouped by query shape, that were run against a collection that would benefit from the suggested index. The Performance Advisor doesn\\'t negatively affect the performance of your Atlas clusters.\\n\\nYou can also monitor collection-level query latency with Namespace Insights and query performance with the Query Profiler.\\n\\nIf the slow query log contains consecutive `$match` stages in the aggregation pipeline, the two stages can coalesce into the first `$match` stage and result in a single `$match` stage. As a result, the query shape in the Performance Advisor might differ from the actual query you ran.\\n\\n## Common Reasons for Slow Queries\\n\\nIf a query is slow, common reasons include:\\n\\n- The query is unsupported by your current indexes.\\n\\n- Some documents in your collection have large array fields that are costly to search and index.\\n\\n- One query retrieves information from multiple collections with $lookup.\\n\\n## Required Access\\n\\nTo view collections with slow queries and see suggested indexes, you must have `Project Read Only` access or higher to the project.\\n\\nTo view field values in a sample query in the Performance Advisor, you must have `Project Data Access Read/Write` access or higher to the project.\\n\\nTo enable or disable the Atlas-managed slow operation threshold, you must have `Project Owner` access to the project. Users with `Organization Owner` access must add themselves to the project as a `Project Owner`.\\n\\n## Configure the Slow Query Threshold\\n\\nBy default, Atlas dynamically adjusts your slow query threshold based on the execution time of operations across your cluster. However, you can opt out of this feature and instead use a fixed slow query threshold of 100 milliseconds. You can disable the Atlas-managed slow operation threshold with the Atlas CLI, Atlas Administration API, or Atlas UI.\\n\\nAtlas clusters with Atlas Search enabled don\\'t support the Atlas-managed slow query operation threshold.\\n\\nFor `M0`, `M2`, `M5` clusters and serverless instances, Atlas disables the Atlas-managed slow query operation threshold by default and you can\\'t enable it.\\n\\n### Disable the Atlas-Managed Slow Operation Threshold\\n\\nBy default, Atlas dynamically adjusts your slow query threshold based on the execution time of operations across your cluster. If you disable the Atlas-managed slow query threshold, it no longer dynamically adjusts. MongoDB defaults the fixed slow query threshold to 100 milliseconds. We don\\'t recommend that you set the fixed slow query threshold lower than 100 milliseconds.\\n\\nTo disable the Atlas-managed slow operation threshold and use a fixed threshold of 100 milliseconds:\\n\\n<Tabs>\\n\\n<Tab name=\"Atlas CLI\">\\n\\nTo disable the Atlas-managed slow operation threshold for your project using the Atlas CLI, run the following command:\\n\\n```sh\\n\\natlas performanceAdvisor slowOperationThreshold disable [options]\\n\\n```\\n\\nTo learn more about the command syntax and parameters, see the Atlas CLI documentation for atlas performanceAdvisor slowOperationThreshold disable.\\n\\n- Install the Atlas CLI\\n\\n- Connect to the Atlas CLI\\n\\n</Tab>\\n\\n<Tab name=\"Atlas Administration API\">\\n\\nSee Disable Managed Slow Operation Threshold.\\n\\n</Tab>\\n\\n<Tab name=\"Atlas UI\">\\n\\nIn the Project Settings for the current project, toggle Managed Slow Operations to Off.\\n\\n</Tab>\\n\\n</Tabs>\\n\\n### Enable the Atlas-Managed Slow Operation Threshold\\n\\nAtlas enables the Atlas-managed slow operation threshold by default. To re-enable the Atlas-managed slow operation threshold that you previously disabled:\\n\\n<Tabs>\\n\\n<Tab name=\"Atlas CLI\">\\n\\nTo enable the Atlas-managed slow operation threshold for your project using the Atlas CLI, run the following command:\\n\\n```sh\\n\\natlas performanceAdvisor slowOperationThreshold enable [options]\\n\\n```\\n\\nTo learn more about the command syntax and parameters, see the Atlas CLI documentation for atlas performanceAdvisor slowOperationThreshold enable.\\n\\n- Install the Atlas CLI\\n\\n- Connect to the Atlas CLI\\n\\n</Tab>\\n\\n<Tab name=\"Atlas Administration API\">\\n\\nSee Enable Managed Slow Operation Threshold.\\n\\n</Tab>\\n\\n<Tab name=\"Atlas UI\">\\n\\nIn the Project Settings for the current project, toggle Managed Slow Operations to On.\\n\\n</Tab>\\n\\n</Tabs>\\n\\n## Index Considerations\\n\\nIndexes improve read performance, but a large number of indexes can negatively impact write performance since indexes must be updated during writes. If your collection already has several indexes, consider this tradeoff of read and write performance when deciding whether to create new indexes. Examine whether a query for such a collection can be modified to take advantage of existing indexes, as well as whether a query occurs often enough to justify the cost of a new index.\\n\\n## Access Performance Advisor\\n\\n<Tabs>\\n\\n<Tab name=\"Atlas CLI\">\\n\\n### View Collections with Slow Queries\\n\\nTo return up to 20 namespaces in `<database>.<collection>` format for collections experiencing slow queries using the Atlas CLI, run the following command:\\n\\n```sh\\n\\natlas performanceAdvisor namespaces list [options]\\n\\n```\\n\\nTo learn more about the command syntax and parameters, see the Atlas CLI documentation for atlas performanceAdvisor namespaces list.\\n\\n- Install the Atlas CLI\\n\\n- Connect to the Atlas CLI\\n\\n### View Slow Query Logs\\n\\nTo return query log line items for slow queries that the Performance Advisor and Query Profiler identify using the Atlas CLI, run the following command:\\n\\n```sh\\n\\natlas performanceAdvisor slowQueryLogs list [options]\\n\\n```\\n\\nTo learn more about the command syntax and parameters, see the Atlas CLI documentation for atlas performanceAdvisor slowQueryLogs list.\\n\\n- Install the Atlas CLI\\n\\n- Connect to the Atlas CLI\\n\\n### View Suggested Indexes\\n\\nTo return suggested indexes for collections experiencing slow queries using the Atlas CLI, run the following command:\\n\\n```sh\\n\\natlas performanceAdvisor suggestedIndexes list [options]\\n\\n```\\n\\nTo learn more about the command syntax and parameters, see the Atlas CLI documentation for atlas performanceAdvisor suggestedIndexes list.\\n\\n- Install the Atlas CLI\\n\\n- Connect to the Atlas CLI\\n\\n</Tab>\\n\\n<Tab name=\"Atlas UI\">\\n\\nTo access the Performance Advisor using the Atlas UI:\\n\\n<Tabs>\\n\\n<Tab name=\"M10+ Clusters\">\\n\\n### Click Database.\\n\\n### Click the replica set where the collection resides.\\n\\nIf the replica set resides in a sharded cluster, first click the sharded cluster containing the replica set.\\n\\n### Click Performance Advisor.\\n\\n### Select a collection from the Collections dropdown.\\n\\n### Select a time period from the Time Range dropdown.\\n\\n</Tab>\\n\\n<Tab name=\"Serverless Instances\">\\n\\n### Click Database.\\n\\n### Click the serverless instance.\\n\\n### Click Performance Advisor.\\n\\n</Tab>\\n\\n</Tabs>\\n\\n</Tab>\\n\\n</Tabs>\\n\\nThe Performance Advisor displays up to 20 query shapes across all collections in the cluster and suggested indexes for those shapes. The Performance Advisor ranks the indexes according to their Impact, which indicates High or Medium based on the total wasted bytes read. To learn more about index ranking, see Review Index Ranking.\\n\\n## Index Suggestions\\n\\nThe Performance Advisor ranks the indexes that it suggests according to their Impact, which indicates High or Medium based on the total wasted bytes read. To learn more about how the Performance Advisor ranks indexes, see Review Index Ranking.\\n\\nTo learn how to create indexes that the Performance Advisor suggests, see Create Suggested Indexes.\\n\\n### Index Metrics\\n\\nEach index that the Performance Advisor suggests contains the following metrics. These metrics apply specifically to queries which would be improved by the index:\\n\\n<table>\\n<tr>\\n<th id=\"Metric\">\\nMetric\\n\\n</th>\\n<th id=\"Description\">\\nDescription\\n\\n</th>\\n</tr>\\n<tr>\\n<td headers=\"Metric\">\\nExecution Count\\n\\n</td>\\n<td headers=\"Description\">\\nNumber of queries executed per hour which would be improved.\\n\\n</td>\\n</tr>\\n<tr>\\n<td headers=\"Metric\">\\nAverage Execution Time\\n\\n</td>\\n<td headers=\"Description\">\\nCurrent average execution time in milliseconds for affected queries.\\n\\n</td>\\n</tr>\\n<tr>\\n<td headers=\"Metric\">\\nAverage Query Targeting\\n\\n</td>\\n<td headers=\"Description\">\\nAverage number of documents read per document returned by affected queries. A higher query targeting score indicates a greater degree of inefficiency. For more information on query targeting, see Query Targeting.\\n\\n</td>\\n</tr>\\n<tr>\\n<td headers=\"Metric\">\\nIn Memory Sort\\n\\n</td>\\n<td headers=\"Description\">\\nCurrent number of affected queries per hour that needed to be sorted in memory.\\n\\n</td>\\n</tr>\\n<tr>\\n<td headers=\"Metric\">\\nAverage Docs Scanned\\n\\n</td>\\n<td headers=\"Description\">\\nAverage number of documents scanned.\\n\\n</td>\\n</tr>\\n<tr>\\n<td headers=\"Metric\">\\nAverage Docs Returned\\n\\n</td>\\n<td headers=\"Description\">\\nAverage number of documents returned.\\n\\n</td>\\n</tr>\\n<tr>\\n<td headers=\"Metric\">\\nAverage Object Size\\n\\n</td>\\n<td headers=\"Description\">\\nAverage object size.\\n\\n</td>\\n</tr>\\n</table>\\n\\n### Sample Queries\\n\\nFor each suggested index, the Performance Advisor shows the most commonly executed query shapes that the index would improve. For each query shape, the Performance Advisor displays the following metrics:\\n\\n<table>\\n<tr>\\n<th id=\"Metric\">\\nMetric\\n\\n</th>\\n<th id=\"Description\">\\nDescription\\n\\n</th>\\n</tr>\\n<tr>\\n<td headers=\"Metric\">\\nExecution Count\\n\\n</td>\\n<td headers=\"Description\">\\nNumber of queries executed per hour which match the query shape.\\n\\n</td>\\n</tr>\\n<tr>\\n<td headers=\"Metric\">\\nAverage Execution Time\\n\\n</td>\\n<td headers=\"Description\">\\nAverage execution time in milliseconds for queries which match the query shape.\\n\\n</td>\\n</tr>\\n<tr>\\n<td headers=\"Metric\">\\nAverage Query Targeting\\n\\n</td>\\n<td headers=\"Description\">\\nAverage number of documents read for every document returned by matching queries. A higher query targeting score indicates a greater degree of inefficiency. For more information on query targeting, see Query Targeting.\\n\\n</td>\\n</tr>\\n<tr>\\n<td headers=\"Metric\">\\nAverage Docs Scanned\\n\\n</td>\\n<td headers=\"Description\">\\nAverage number of documents scanned.\\n\\n</td>\\n</tr>\\n<tr>\\n<td headers=\"Metric\">\\nAverage Docs Returned\\n\\n</td>\\n<td headers=\"Description\">\\nAverage number of documents returned.\\n\\n</td>\\n</tr>\\n</table>The Performance Advisor also shows each executed sample query that matches the query shape, with specific metrics for that query.\\n\\n### Query Targeting\\n\\nEach index suggestion includes an Average Query Targeting score indicating how many documents were read for every document returned for the index\\'s corresponding query shapes. A score of 1 represents very efficient query shapes because every document read matched the query and was returned with the query results. All suggested indexes represent an opportunity to improve query performance.\\n\\n### Filter Index Suggestions\\n\\nBy default, the Performance Advisor suggests indexes for all clusters in the deployment. To only show suggested indexes from a specific collection, use the Collection dropdown at the top of the Performance Advisor.\\n\\nYou can also adjust the time range the Performance Advisor takes into account when suggesting indexes by using the Time Range dropdown at the top of the Performance Advisor.\\n\\n### Limitations of Index Suggestions\\n\\n#### Timestamp Format\\n\\nThe Performance Advisor can\\'t suggest indexes for MongoDB databases configured to use the `ctime` timestamp format. As a workaround, set the timestamp format for such databases to either `iso8601-utc` or `iso8601-local`. To learn more about timestamp formats, see mongod --timeStampFormat.\\n\\n#### Log Size\\n\\nThe Performance Advisor analyzes up to 200,000 of your cluster\\'s most recent log lines.\\n\\n#### Log Quantity\\n\\nIf a cluster experiences an activity spike and generates an extremely large quantity of log messages, Atlas may stop collecting and storing new logs for a period of time.\\n\\nLog analysis rate limits apply only to the Performance Advisor UI, the Query Insights UI, the Access Tracking UI, and the Atlas Search Query Analytics UI. Downloadable log files are always complete.\\n\\n#### Time-Series Collections\\n\\nThe Performance Advisor doesn\\'t provide performance suggestions for time-series collections.\\n\\n#### User Feedback\\n\\nThe Performance Advisor includes a user feedback button for Index Suggestions. Atlas hides this button for serverless instances.\\n\\n## Create Suggested Indexes\\n\\nYou can create indexes suggested by the Performance Advisor directly within the Performance Advisor itself. When you create indexes, keep the ratio of reads to writes on the target collection in mind. Indexes come with a performance cost, but are more than worth the cost for frequent queries on large data sets. To learn more about indexing strategies, see Indexing Strategies.\\n\\n### Behavior and Limitations\\n\\n- You can\\'t create indexes through the Performance Advisor if Data Explorer is disabled for your project. You can still view the Performance Advisor recommendations, but you must create those indexes from `mongosh`.\\n\\n- You can only create one index at a time through the Performance Advisor. If you want to create more simultaneously, you can do so using the Atlas UI, a driver, or the shell\\n\\n- Atlas always creates indexes for entire clusters. If you create an index while viewing the Performance Advisor for a single shard in a sharded cluster, Atlas creates that index for the entire sharded cluster.\\n\\n### Procedure\\n\\nTo create a suggested index:\\n\\n#### For the index you want to create, click Create Index.\\n\\nThe Performance Advisor opens the Create Index dialog and prepopulates the Fields based on the index you selected.\\n\\n#### *(Optional)* Specify the index options.\\n\\n```javascript\\n{ <option1>: <value1>, ... }\\n```\\n\\nThe following options document specifies the `unique` option and the `name` for the index:\\n\\n```javascript\\n{ unique: true, name: \"myUniqueIndex\" }\\n```\\n\\n#### *(Optional)* Set the Collation options.\\n\\nUse collation to specify language-specific rules for string comparison, such as rules for lettercase and accent marks. The collation document contains a `locale` field which indicates the ICU Locale code, and may contain other fields to define collation behavior.\\n\\nThe following collation option document specifies a locale value of `fr` for a French language collation:\\n\\n```json\\n{ \"locale\": \"fr\" }\\n```\\n\\nTo review the list of locales that MongoDB collation supports, see the list of languages and locales. To learn more about collation options, including which are enabled by default for each locale, see Collation in the MongoDB manual.\\n\\n#### *(Optional)* Enable building indexes in a rolling fashion.\\n\\nRolling index builds succeed only when they meet certain conditions. To ensure your index build succeeds, avoid the following design patterns that commonly trigger a restart loop:\\n\\n- Index key exceeds the index key limit\\n\\n- Index name already exists\\n\\n- Index on more than one array field\\n\\n- Index on collection that has the maximum number of text indexes\\n\\n- Text index on collection that has the maximum number of text indexes\\n\\nthe Atlas UI doesn\\'t support building indexes with a rolling build for `M0` free clusters and `M2/M5` shared clusters. You can\\'t build indexes with a rolling build for serverless instances.\\n\\nFor workloads which cannot tolerate performance decrease due to index builds, consider building indexes in a rolling fashion.\\n\\nTo maintain cluster availability:\\n\\n- Atlas removes one node from the cluster at a time starting with a secondary.\\n\\n- More than one node can go down at a time, but Atlas always keeps a majority of the nodes online.\\n\\nAtlas automatically cancels rolling index builds that don\\'t succeed on all nodes. When a rolling index build completes on some nodes, but fails on others, Atlas cancels the build and removes the index from any nodes that it was successfully built on.\\n\\nIn the event of a rolling index build cancellation, Atlas generates an activity feed event and sends a notification email to the project owner with the following information:\\n\\n- Name of the cluster on which the rolling index build failed\\n\\n- Namespace on which the rolling index build failed\\n\\n- Project that contains the cluster and namespace\\n\\n- Organization that contains the project\\n\\n- Link to the activity feed event\\n\\nTo learn more about rebuilding indexes, see Build Indexes on Replica Sets.\\n\\nUnique\\nindex options are incompatible with building indexes in a rolling fashion. If you specify `unique` in the Options pane, Atlas rejects your configuration with an error message.\\n\\n#### Click Review.\\n\\n#### In the Confirm Operation dialog, confirm your index.\\n\\nWhen an index build completes, Atlas generates an activity feed event and sends a notification email to the project owner with the following information:\\n\\n- Completion date of the index build\\n\\n- Name of the cluster on which the index build completed\\n\\n- Namespace on which the index build completed\\n\\n- Project containing the cluster and namespace\\n\\n- Organization containing the project\\n\\n- Link to the activity feed event\\n\\n\\n\\n# Fix Query Issues\\n\\n`Query Targeting` alerts often indicate inefficient queries.\\n\\n## Alert Conditions\\n\\nYou can configure the following alert conditions in the project-level alert settings page to trigger alerts.\\n\\n`Query Targeting: Scanned Objects / Returned` alerts are triggered when the average number of documents scanned relative to the average number of documents returned server-wide across all operations during a sampling period exceeds a defined threshold. The default alert uses a 1000:1 threshold.\\n\\nIdeally, the ratio of scanned documents to returned documents should be close to 1. A high ratio negatively impacts query performance.\\n\\n`Query Targeting: Scanned / Returned` occurs if the number of index keys examined to fulfill a query relative to the actual number of returned documents meets or exceeds a user-defined threshold. This alert is not enabled by default.\\n\\nThe following mongod log entry shows statistics generated from an inefficient query:\\n\\n```json\\n<Timestamp> COMMAND  <query>\\nplanSummary: COLLSCAN keysExamined:0\\ndocsExamined: 10000 cursorExhausted:1 numYields:234\\nnreturned:4  protocol:op_query 358ms\\n```\\n\\nThis query scanned 10,000 documents and returned only 4 for a ratio of 2500, which is highly inefficient. No index keys were examined, so MongoDB scanned all documents in the collection, known as a collection scan.\\n\\n## Common Triggers\\n\\nThe query targeting alert typically occurs when there is no index to support a query or queries or when an existing index only partially supports a query or queries.\\n\\nThe change streams cursors that the Atlas Search process (`mongot`) uses to keep Atlas Search indexes updated can contribute to the query targeting ratio and trigger query targeting alerts if the ratio is high.\\n\\n## Fix the Immediate Problem\\n\\nAdd one or more indexes to better serve the inefficient queries.\\n\\nThe Performance Advisor provides the easiest and quickest way to create an index. The Performance Advisor monitors queries that MongoDB considers slow and recommends indexes to improve performance. Atlas dynamically adjusts your slow query threshold based on the execution time of operations across your cluster.\\n\\nClick Create Index on a slow query for instructions on how to create the recommended index.\\n\\nIt is possible to receive a Query Targeting alert for an inefficient query without receiving index suggestions from the Performance Advisor if the query exceeds the slow query threshold and the ratio of scanned to returned documents is greater than the threshold specified in the alert.\\n\\nIn addition, you can use the following resources to determine which query generated the alert:\\n\\n- The Real-Time Performance Panel monitors and displays current network traffic and database operations on machines hosting MongoDB in your Atlas clusters.\\n\\n- The MongoDB logs maintain an account of activity, including queries, for each `mongod` instance in your Atlas clusters.\\n\\n- The cursor.explain() command for `mongosh` provides performance details for all queries.\\n\\n- Namespace Insights monitors collection-level query latency.\\n\\n- The Atlas Query Profiler records operations that Atlas considers slow when compared to average execution time for all operations on your cluster.\\n\\n## Implement a Long-Term Solution\\n\\nRefer to the following for more information on query performance:\\n\\n- MongoDB Indexing Strategies\\n\\n- Query Optimization\\n\\n- Analyze Query Plan\\n\\n## Monitor Your Progress\\n\\nAtlas provides the following methods to visualize query targeting:\\n\\n- Query Targeting metrics, which highlight high ratios of objects scanned to objects returned.\\n\\n- Namespace Insights, which monitors collection-level query latency.\\n\\n- The Query Profiler, which describes specific inefficient queries executed on the cluster.\\n\\n### Query Targeting Metrics\\n\\nYou can view historical metrics to help you visualize the query performance of your cluster. To view Query Targeting metrics in the Atlas UI:\\n\\n1. Click Database in the top-left corner of Atlas.\\n\\n2. Click View Monitoring on the dashboard for the cluster.\\n\\n3. On the Metrics page, click the Add Chart dropdown menu and select Query Targeting.\\n\\nThe Query Targeting chart displays the following metrics for queries executed on the server:\\n\\n<table>\\n<tr>\\n<th id=\"Metric\">\\nMetric\\n\\n</th>\\n<th id=\"Description\">\\nDescription\\n\\n</th>\\n</tr>\\n<tr>\\n<td headers=\"Metric\">\\nScanned Objects / Returned\\n\\n</td>\\n<td headers=\"Description\">\\nIndicates the average number of documents examined relative to the average number of returned documents.\\n\\n</td>\\n</tr>\\n<tr>\\n<td headers=\"Metric\">\\nScanned / Returned\\n\\n</td>\\n<td headers=\"Description\">\\nIndicates the number of index keys examined to fulfill a query relative to the actual number of returned documents.\\n\\n</td>\\n</tr>\\n</table>The change streams cursors that the Atlas Search process (`mongot`) uses to keep Atlas Search indexes updated can contribute to the query targeting ratio and trigger query targeting alerts if the ratio is high.\\n\\nIf either of these metrics exceed the user-defined threshold, Atlas generates the corresponding `Query Targeting: Scanned Objects / Returned` or `Query Targeting: Scanned / Returned` alert.\\n\\nYou can also view Query Targeting ratios of operations in real-time using the Real-Time Performance Panel.\\n\\n### Namespace Insights\\n\\nNamespace Insights monitors collection-level query latency. You can view query latency metrics and statistics for certain hosts and operation types. Manage pinned namespaces and choose up to five namespaces to show in the corresponding query latency charts.\\n\\nTo access Namespace Insights:\\n\\n1. Click Database in the top-left corner of Atlas.\\n\\n2. Click View Monitoring on the dashboard for the cluster.\\n\\n3. Click the Query Insights tab.\\n\\n4. Click the Namespace Insights tab.\\n\\n### Query Profiler\\n\\nThe Query Profiler contains several metrics you can use to pinpoint specific inefficient queries. You can visualize up to the past 24 hours of query operations. The Query Profiler can show the Examined : Returned Ratio (index keys examined to documents returned) of logged queries, which might help you identify the queries that triggered a `Query Targeting: Scanned / Returned` alert. The chart shows the number of index keys examined to fulfill a query relative to the actual number of returned documents.\\n\\nThe default\\n`Query Targeting: Scanned Objects / Returned` alert ratio differs slightly. The ratio of the average number of documents scanned to the average number of documents returned during a sampling period triggers this alert.\\n\\nAtlas might not log the individual operations that contribute to the Query Targeting ratios due to automatically set thresholds. However, you can still use the Query Profiler and Query Targeting metrics to analyze and optimize query performance.\\n\\nTo access the Query Profiler:\\n\\n1. Click Database in the top-left corner of Atlas.\\n\\n2. Click View Monitoring on the dashboard for the cluster.\\n\\n3. Click the Query Insights tab.\\n\\n4. Click the Query Profiler tab.\\n\\n\\n\\n# Analyze Slow Queries\\n\\nAtlas provides several tools to help analyze slow queries executed on your clusters. See the following sections for descriptions of each tool. To optimize your query performance, review the best practices for query performance.\\n\\n## Performance Advisor\\n\\nThe Performance Advisor monitors queries that MongoDB considers slow and suggests new indexes to improve query performance.\\n\\nYou can use the Performance Advisor to review the following information:\\n\\n- Index Ranking\\n\\n- Drop Index Recommendations\\n\\n## Namespace Insights\\n\\nMonitor collection-level query latency with Namespace Insights. You can view query latency metrics and statistics for certain hosts and operation types. Manage pinned namespaces and choose up to five namespaces to show in the corresponding query latency charts.\\n\\n## Query Profiler\\n\\nThe Query Profiler displays slow-running operations and their key performance statistics. You can explore a sample of historical queries for up to the last 24 hours without additional cost or performance overhead. Before you enable the Query Profiler, see Considerations.\\n\\n## Real-Time Performance Panel (RTPP)\\n\\nThe Real-Time Performance Panel identifies relevant database operations, evaluates query execution times, and shows the ratio of documents scanned to documents returned during query execution. RTPP (Real-Time Performance Panel) is enabled by default.\\n\\nTo enable or disable Real-Time Performance Panel for a project, you must have the `Project Owner` role for the project.\\n\\n## Best Practices for Query Performance\\n\\nTo optimize query performance, review the following best practices:\\n\\n- Create queries that your current indexes support to reduce the time needed to search for your results.\\n\\n- Avoid creating documents with large array fields that require a lot of processing to search and index.\\n\\n- Optimize your indexes and remove unused or inefficent indexes. Too many indexes can negatively impact write performance.\\n\\n- Consider the suggested indexes from the Performance Advisor with the highest Impact scores and lowest Average Query Targeting scores.\\n\\n- Create the indexes that the Performance Advisor suggests when they align with your Indexing Strategies.\\n\\n- The Performance Advisor cannot suggest indexes for MongoDB databases configured to use the ctime timestamp format. As a workaround, set the timestamp format for such databases to either iso8601-utc or iso8601-local.\\n\\n- Perform rolling index builds to reduce the performance impact of building indexes on replica sets and sharded clusters.\\n\\n- Drop unused, redundant, and hidden indexes to improve write performance and free storage space.\\n\\n', name='get_info_about_mongodb', id='58b5fb08-1776-49d8-a6f6-956431f77388', tool_call_id='call_sifH0mrhbpesQie4BTnQytNk')]}\n",
            "Node agent:\n",
            "{'messages': [AIMessage(content=\"To improve slow queries in MongoDB, you can follow these steps:\\n\\n### 1. **Analyze the Problem**\\n   - Use the **Performance Advisor** to monitor slow queries and get index recommendations.\\n   - Check the **Query Profiler** to identify slow-running operations and their key performance statistics.\\n   - Use **Namespace Insights** to monitor collection-level query latency.\\n   - Analyze the **Real-Time Performance Panel (RTPP)** for real-time query execution metrics.\\n\\n### 2. **Common Causes of Slow Queries**\\n   - Queries are not supported by existing indexes.\\n   - Large array fields in documents that are costly to search and index.\\n   - Queries involving multiple collections using `$lookup`.\\n\\n### 3. **Fix Immediate Issues**\\n   - **Add Indexes**: Create indexes to support inefficient queries. The Performance Advisor provides suggestions for indexes with high impact.\\n   - **Optimize Queries**: Ensure queries are designed to utilize existing indexes effectively.\\n   - **Avoid Collection Scans**: If a query scans all documents in a collection (COLLSCAN), it indicates the need for an index.\\n\\n### 4. **Long-Term Solutions**\\n   - **Optimize Indexes**: Remove unused or redundant indexes to improve write performance.\\n   - **Monitor Query Targeting**: Keep the ratio of documents scanned to documents returned close to 1.\\n   - **Avoid Large Arrays**: Minimize the use of large array fields in documents.\\n\\n### 5. **Best Practices**\\n   - Use the **Query Targeting Metrics** to identify inefficiencies.\\n   - Perform **rolling index builds** to minimize performance impact on replica sets and sharded clusters.\\n   - Drop unused or hidden indexes to free up storage and improve write performance.\\n\\n### 6. **Tools for Monitoring and Optimization**\\n   - **Performance Advisor**: Suggests indexes and provides query insights.\\n   - **Query Profiler**: Displays slow-running queries and their statistics.\\n   - **Namespace Insights**: Monitors query latency at the collection level.\\n   - **Real-Time Performance Panel**: Provides real-time metrics for query execution.\\n\\nBy following these steps and utilizing MongoDB's built-in tools, you can significantly improve the performance of slow queries.\", additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 451, 'prompt_tokens': 5355, 'total_tokens': 5806, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-4o-2024-11-20', 'system_fingerprint': 'fp_d924043139', 'finish_reason': 'stop', 'logprobs': None}, id='run-ad3b553a-e5e6-4c9e-9246-d6e0f7286abb-0', usage_metadata={'input_tokens': 5355, 'output_tokens': 451, 'total_tokens': 5806, 'input_token_details': {'audio': 0, 'cache_read': 0}, 'output_token_details': {'audio': 0, 'reasoning': 0}})]}\n",
            "---FINAL ANSWER---\n",
            "To improve slow queries in MongoDB, you can follow these steps:\n",
            "\n",
            "### 1. **Analyze the Problem**\n",
            "   - Use the **Performance Advisor** to monitor slow queries and get index recommendations.\n",
            "   - Check the **Query Profiler** to identify slow-running operations and their key performance statistics.\n",
            "   - Use **Namespace Insights** to monitor collection-level query latency.\n",
            "   - Analyze the **Real-Time Performance Panel (RTPP)** for real-time query execution metrics.\n",
            "\n",
            "### 2. **Common Causes of Slow Queries**\n",
            "   - Queries are not supported by existing indexes.\n",
            "   - Large array fields in documents that are costly to search and index.\n",
            "   - Queries involving multiple collections using `$lookup`.\n",
            "\n",
            "### 3. **Fix Immediate Issues**\n",
            "   - **Add Indexes**: Create indexes to support inefficient queries. The Performance Advisor provides suggestions for indexes with high impact.\n",
            "   - **Optimize Queries**: Ensure queries are designed to utilize existing indexes effectively.\n",
            "   - **Avoid Collection Scans**: If a query scans all documents in a collection (COLLSCAN), it indicates the need for an index.\n",
            "\n",
            "### 4. **Long-Term Solutions**\n",
            "   - **Optimize Indexes**: Remove unused or redundant indexes to improve write performance.\n",
            "   - **Monitor Query Targeting**: Keep the ratio of documents scanned to documents returned close to 1.\n",
            "   - **Avoid Large Arrays**: Minimize the use of large array fields in documents.\n",
            "\n",
            "### 5. **Best Practices**\n",
            "   - Use the **Query Targeting Metrics** to identify inefficiencies.\n",
            "   - Perform **rolling index builds** to minimize performance impact on replica sets and sharded clusters.\n",
            "   - Drop unused or hidden indexes to free up storage and improve write performance.\n",
            "\n",
            "### 6. **Tools for Monitoring and Optimization**\n",
            "   - **Performance Advisor**: Suggests indexes and provides query insights.\n",
            "   - **Query Profiler**: Displays slow-running queries and their statistics.\n",
            "   - **Namespace Insights**: Monitors query latency at the collection level.\n",
            "   - **Real-Time Performance Panel**: Provides real-time metrics for query execution.\n",
            "\n",
            "By following these steps and utilizing MongoDB's built-in tools, you can significantly improve the performance of slow queries.\n"
          ]
        }
      ],
      "source": [
        "# Execute the agent and view outputs\n",
        "inputs = {\n",
        "    \"messages\": [\n",
        "        (\"user\", \"How do I improve slow queries in MongoDB?\"),\n",
        "    ]\n",
        "}\n",
        "\n",
        "for output in app.stream(inputs):\n",
        "    for key, value in output.items():\n",
        "        print(f\"Node {key}:\")\n",
        "        print(value)\n",
        "print(\"---FINAL ANSWER---\")\n",
        "print(value[\"messages\"][-1].content)"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": ".venv",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.12.1"
    },
    "widgets": {
      "application/vnd.jupyter.widget-state+json": {
        "state": {}
      }
    }
  },
  "nbformat": 4,
  "nbformat_minor": 2
}
